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Abstract 


Grouped jackknifing may be used to evaluate the stability of equating procedures with respect to 
sampling error and with respect to changes in anchor selection. Properties of grouped jackknifing 
are reviewed for simple-random and stratified sampling, and its use is described for comparisons 
of anchor sets. Application is made to examples of item response theory (IRT) true-score equating 
in which two-parameter logistic and general partial credit models are employed. 
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Equating of test forms involves sampling of examinees, so that random equating errors are 
introduced through estimation of equating parameters. When anchor items are employed in 
equating and when classical equating assumptions apply, the choice of anchor items should have 
minimal effect on the equating process. In the real world, it is not necessarily true that choice 
of anchor items has minimal effect. To evaluate variability in equating due to sampling error 
and variability of equating due to selection of anchor items, jackknifing may be employed. This 
report illustrates use of jackknifing in the case of IRT true score equating, but jackknifing may be 
employed with other approaches as well. 

Jackknifing is a commonly employed statistical technique for estimation of variances of sample 
statistics (Quenouille, 1956; Tukey, 1958; Miller, 1964). It may be employed to obtain approximate 
confidence intervals for population measures of interest. Applications of jackknifing commonly 
involve cases in which it is difficult to apply the 4-method (Rao, 1973, p. 388) to estimate 
variances. Given the large number of steps involved in IRT true-score equating, the 4-method is 
challenging to apply; however, the grouped jackknifing approach (Miller, 1964) is readily used 
to study sampling errors associated with conversions of test scores. Grouped jackknifing is an 
example of a resampling method because it employs estimates based on selected subsamples of 
the observed data. It requires much less computational labor than other resampling methods such 
as bootstrapping methods (Efron, 1979, 1982), traditional jackknifing (Quenouille, 1956; Tukey, 
1958), or delete-d versions of the jackknife in which d > 1 (Shao & Wu, 1989). 

Jackknifing may also be employed to examine the stability of IRT true-score equating with 
respect to the choice of anchor items. This stability can be examined in two distinct fashions. 

In one case, the effect of a specified change in the anchor set can be studied by examination of 
the estimated means and standard deviations of the differences between the resulting conversions. 
In another case, anchor items can be regarded as a sample from a collection of possible anchor 
items. One then examines both the variability of conversions due to sampling of examinees 
and the variability of conversions due to selection of anchor items. This latter possibility has 
been considered previously (Cohen, Johnson, & Angeles, 2001); however, this application of 
jackknifing requires additional study to justify its use. In addition, within the context of equating, 
consideration must also be given to the nature of sampling in the case of items. In typical cases, 
testing programs do not randomly select items, so that inferences may be problematic beyond the 
anchor items present in the forms under study. This issue will be discussed further in section 4. 
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Section 1 provides necessary background concerning the grouped jackknife. Section 2 provides 
background concerning IRT true-score equating. In section 3, jackknifing is applied to assess 
variability of conversions in two cases in which two forms of a test are linked by IRT true-score 
equating. Section 4 provides some general observations concerning application of jackknifing to 
the study of equating. 


1 The Traditional and the Grouped Jackknife 

The grouped jackknife is an old example of a resampling method (Efron, 1979, 1982). It is 
primarily of interest when computational cost is a major issue. To explain grouped jackknifing, 
it is helpful to begin with elementary methods to estimate standard errors and obtain confidence 
intervals for the population mean and population standard deviation. These examples lead 
to some simple illustrations of traditional delete-1 jackknifing procedures in which a series of 
estimates are computed by removing one observation from the sample. The analysis of traditional 
delete-1 jackknifing then leads to grouped jackknifing in which the observations are divided into 
groups and estimates are computed by leaving out one group from the sample. 

In discussion of delete-1 jackknifing, the sample mean of independent and identically 
distributed random variables has a fundamental role. One basic justification of delete-1 jackknifing 
is the fact that it results in customary inferences concerning the population mean when the sample 
mean is employed. General justification of delete-1 jackknifing involves a demonstration that the 
parameter estimates under study are well approximated by sample means. Such approximations 
are typically available when parameter estimates are differentiable functions of sample means. 

To begin, consider the sample mean of the real observations Xj, 1 < i < n, n > 1, obtained by 
random sampling with replacement. For example, the X{ might be raw scores of examinees for a 
particular test administration, where the examinees are regarded as a sample from a hypothetical 
infinite population of potential examinees. Let the A* be random variables with common mean 
/j and common variance r 2 > 0. The assumption of random sampling with replacement implies 
that the Xi are independent and identically distributed. Consider the elementary problem of 
estimation of the expectation n by the sample mean 

n 

X = n~ 1 J2 X i- 

i= 1 

As is well known, X has expectation // and variance cr 2 (X) = r 2 /n. Thus X is an unbiased 
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estimate of (i. In addition, the variance cr 2 (X) has a simple unbiased estimate, for the sample 
variance 

n 

^ = {n-l)~ 1 Y i (X i -X) 2 

i =1 

has expectation r 2 , and <t 2 (X) = s 2 /n, the estimated variance of the sample mean, is an 
unbiased estimate of <7 2 (X) = r 2 /n, the variance of the sample mean. In addition, the ratio 
a 2 (X) / a 2 (X) converges to 1 with probability 1 as the sample size n becomes large (Shao, 2003, 
p. 133). The estimated standard error a(X) of the sample mean is the square root of the 
estimated variance cr 2 (X) of the sample mean. When the sample size n is large, (X — n)/a(X) 
has an approximate standard normal distribution, so that approximate confidence intervals for 
H are readily constructed (Scheffe, 1959, p. 355). For any real a such that 0 < a < 1 and any 
positive integer v, let t v a be defined so that a is the probability that a random variable with a t 
distribution on u degrees of freedom has absolute value at least as large as t v>a . In addition, let z a 
be defined so that a is the probability that a random variable with a standard normal distribution 
has absolute value at least as large as t UjQ . Then the customary approximate two-sided confidence 
interval for /_/ of level l — a has lower bound 

HLa — A t n —\ !Ct (j(X) 


and upper bound 

HUa = X + tn— l,«d(A). 

As the sample size n increases, the probability approaches 1 — a that HLa < H < Hu a- I n 
the special case in which the X,; have a common normal distribution, (X — / u)/ct(X) has a t 
distribution on n — 1 degrees of freedom, and (n — l)s 2 /r 2 has a chi-squared distribution on n — 1 
degrees of freedom, so that 1 — a is the exact probability that HLa < H ^ Hu a ■ 

In the discussion of grouped jackknifing and traditional delete-1 jackknifing, comparison of 
results is made with those obtained by traditional confidence intervals for the population mean. 
One aspect of this comparison involves expected widths of confidence intervals. These expected 
widths are not difficult to study in the case of the approximate confidence intervals for the 
population mean. For a large sample size n, the multiplier t n -i. a is close to z a . Even for n = 120 
and a = 0.05, t n -i,a = 1-9801 and z a = 1.9600. In general, t v , a is quite well approximated by 
+ (z<x + ^a)/(4^) as v increases (Abramowitz & Stegun, 1965, p. 949). For example, for v = 119 
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and a = 0.05, 

z a + ( z a + z / (4 v) = 1.9799 

is quite close to t n -i,a = 1-9801. In the case of the X{ normally distributed, the expected 
width E(hjj q ~ At La ) of the confidence interval of level 1 — a is readily found. Because 
E(s) = r(n/2)[2/(n — l)] 1/,2 r/r((n — l)/2)) (Cramer, 1946, p. 383), where T denotes the gamma 
function, 

E(nu a - HLa) = 2t n - lt0l T(n/2)[2/n(n - l)] 1/2 T/T((n - l)/2)). 

This width is quite close to z a T /n 1//2 even for n of moderate size. For example, if n = 120 and 
a = 0.05, the width of 3.9519T/U 1 / 2 is quite close to 2z a r/n l ^ 2 = 3.9199r/n 1//2 . 

These familiar results for the sample mean do not apply even for such simple summary 
statistics as the sample standard deviation s, the square root of s 2 . The sample standard 
deviation is commonly used to estimate the common standard deviation r of the observations X t . 
Nonetheless, the expectation E(s) of s is not r, and the variance u 2 (s) of s does not have an 
unbiased estimate. The 5 method can be used to study statistical properties of s when the variance 
v 2 of Yi = [(X l — /_i) 2 — t 2 ]/(2t) is finite and positive (Cramer, 1946, p. 353). In this case, as the 
sample size n increases, s is well approximated by r + Y, where Y is the sample mean of the T), 

1 < i < n. Let R = s — t — Y denote the approximation error. As the sample size n increases, the 
mean squared error E(R 2 ) is sufficiently small that E(R 2 )/a 2 (Y) approaches 0. The mean E(s) 
approaches r sufficiently rapidly that [i?(s) — t\/ a 2 {Y) converges to — u 2 /(2t). The variance u 2 (s) 
is well approximated by c 2 (Y) in the sense that [a 2 (Y) — cr 2 (s)]/cr 2 (s) approaches 0 as the sample 
size increases. The Yj are not observed, but one may approximate Y t by Y{ = [(A* — X) 2 — s 2 ]/(2s) 
and obtain an estimate (T 2 (s) for cr 2 (s) equal to the estimated variance of the sample mean for 
observations Y,, 1 < i < n. An approximate confidence interval for r is based on the observation 
that (s — r)/a(s) has an approximate standard normal distribution if the sample size n is large. 
In addition, the ratio o 2 (s)/a 2 (s) converges in probability to 1 as the sample size n becomes 
large; that is, for any positive real number e, as the sample size n increases, the probability that 
a 2 (s)/a 2 (s) differs from 1 by more than e approaches 0. 

1.1 Weights 

Resampling methods provide an alternative approach to variance estimation. These methods 
can be described in terms of sampling weights (Efron, 1982, p. 37). For example, consider the 
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sample mean X. For each observation i, let Wi > 0 be an integer weight assigned to sample 
member i. The weight Wi will represent the number of times sample member i is to be used in 
computation of an estimate. Let w denote the n-dimensional weight vector with coordinate i 
equal to Wi, and let the sum n[w] = YH=i w i °f the weights be positive. Then one may consider 
the weighted mean 

n 

-X]w] = {nfw]}^ 1 WjXj. 

1=1 

Thus X is X[l], where 1 is the vector with all coordinates 1. If w\ = 0 and Wi = 1 for i > 1, then 

n 

x[w] = (n-l^^Xi 
1=2 

is the sample mean for the observations X 2 to X n . In general, X[w] has expectation /j, just as in 
the case of the original sample mean X. 

Weights can also be used with the sample variance and sample standard deviation. Let 
n[w] > 1 , let 

n 

S 2 [w] = {n[ w] - l }” 1 ^Wi(Xi - X[w ]) 2 
i= 1 

and let s[w] be the square root of s 2 [w]. If all weights Wi are 0 or 1, then s 2 [w] has expectation 
r 2 . Note that s 2 [1] is the sample variance s 2 of the X l . / 1 < i < n, and s[l] is the corresponding 
sample standard deviation s. If w\ = 0 and w % = 1 for i > 1, then s 2 [w] is the sample variance 
deviation for the observations X -2 to X n , and s [w] is the corresponding sample standard deviation. 

In general, estimates g [w] will be considered for a real parameter 7 , where g[ 1] will be denoted 
by g. For the weight vectors w under study, the essential requirements are that <?[w] have finite 
variance and that independent and identically distribution random variables Lj, 1 < i < n, with 
mean 0 and variance v 2 > 0 exist such that the estimates g[w] are well approximated by 7 + F[w], 
where the weighted mean 

n 

Y[ w] = {n[w]} _i ^WiYi 
i =1 

(Shao & Wu, 1989). In the case of X[w], Yi = X t — //. In the case of -s[w], the requirement is met 
with Yi = [(Xi — /x) 2 — t 2 ]/t. The approximation requirements involve the approximation error 

R = g 7 Y (l) 

for the complete sample and the approximation error 

R[ w] = ff[w] - 7 - Y[ w] ( 2 ) 
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for the weight vector w with integer weight Wi > 0 assigned to sample member i. 

1.2 Delete-1 Jackknifing 

In traditional delete-1 jackknifing (Quenouille, 1956; Shao & Wu, 1989; Tukey, 1958), weight 
vectors w(j), 1 < j < n, are employed to compute n sample statistics. These weight vectors 
correspond to samples in which one member is omitted. Thus the weight vector w(j) provides 
a weight W{(j ) = 1 to each sample member i not equal to j, but the weight Wj(j ) for sample 
member j is 0. For sample member j, the delete-1 estimate g[w(j)] corresponds to an estimate of 
7 based on the observed Xi for all sample members i except j. For example, w(l) has coordinate 
uq(l) equal to 0 and coordinates Wi( 1 ) = 1 for i > 2, so that g[w(l)] is the estimate based on the 
observations Xi, i > 1. The average delete-1 estimate is then 

n 

9 = n~ l ^s[ w (j)]. 

3 =1 

The jackknife variance estimate for cr 2 (g) is 

— 1 n 

°j{g) = - y} 2 - 

3 =1 

The delete-1 jackknife has desirable large-sample properties when two conditions both hold. 
The first condition is that the mean squared approximation error E(R 2 ) associated with the 
complete sample is sufficiently small so that 

E{R 2 )/a 2 {Y) —> 0 (3) 

as the sample size n becomes large. The second condition requires that the difference R — i?[w(j)] 
between the approximation errors R for the complete sample and R,[w(j)] for the sample with 
member j omitted is sufficiently small so that 

max E({R - i?[w(j)]} 2 )/[a 2 (F)] 2 - 0 (4) 

±<j<n 

as the sample size n increases (Shao & Wu, 1989). Under these conditions, the sample variance 
cr 2 (g) is well approximated by (J 2 (Y) = v 2 /n in the sense that a 2 (g) / a 2 (Y) converges to 1 as 
the sample size n becomes large. The bias E(g) — 7 is sufficiently small so that [E(g) — 'y]/cr(g) 
converges to 0 as the sample size n increases. The approximation cr 2 (g) to the sample variance 
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cr 2 (g) is sufficiently accurate so that o 2 (g)/a 2 (g) converges in probability to 1 as the sample size 
n increases, and the ratio ( g — 7 )/aj(g) has an approximate standard normal distribution. 

Approximate confidence intervals for 7 are readily constructed. For consistency with practice 
for the sample mean, for real a such that 0 < a < 1 , let the lower bound of the approximate 
confidence interval for 7 of level 1 — a be 


7 JLa — g tn—l,ot&j{g)i 


and let the upper bound be 


7 jua — g H“ t n —i,a&j(g)- 


Then the probability that 7 jL a < 7 < 7 jua approaches 1 — a as the sample size n increases. 

In the case of the sample mean, (3) and (4) hold trivially if Y% = Xi — g, 7 = //, v 2 = r 2 , and 
g = X, for R and R[w(j)] are 0. Delete-1 jackknifing leads to conventional inferences concerning 
the population mean. The average of the A[w(j)], 1 < j < n, is the original sample mean X, and 

a 2 (X) = J2(X[w(j)] - X) 2 = §\X) 

il 

3 =1 


(Efron, 1982, pp. 6 , 13). Thus jackknifing simply leads to the conventional estimate of the 
variance of the sample mean. In addition, the jackknife confidence bounds /j, jL a and gjjja satisfy 
gjLa = gjLa and g,JUa = HUa- 

In the case of the sample standard deviation, (3) and (4) may be shown to hold if 7 = r, 
g = s, and Y t = [(A,; — g) 2 — t 2 ]/(2t). Jackknifing yields a different estimate of the variance of s 
than the one obtained previously by use of the 5 method. Let n > 2 . One has 

n 

s = n~ l ^s[w(j)], 

3 =1 


and 

_ n 

oj(») = X]( s [ w (i)] - s) 2 . 

3 =1 

The ratio a 2 (s)/a 2 (s) converges in probability to 1 as the sample size n becomes large, 

(s — t)/gj(s) converges in distribution to a random variable with a standard normal distribution, 
and, for real a such that 0 < a < 1 , the approximate confidence interval for r of level 1 — a has 
lower bound 


TjLa — •S tn— l,adj(s) 
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and upper bound 


TjUa — S + tn—l'Ce&J^s). 


Computations are somewhat easier than may at first appear to be the case, for 

s 2 [ w (i)] = (n - 2) -1 [(n - l)s 2 -n(Xj - X) 2 /(n - 1)] 

(Draper & Smith, 1998, p. 208). As the sample size n increases, the probability approaches 1 — a 
that TjLa < T < TjUa- 

It is possible to demonstrate that the requirements for delete-1 jackknifing are met for the 
equating applications under study under some possible sampling models; however, computation 
of this jackknife estimate of the variance is impractical in the equating examples considered. 
Thousands of observations are involved, and the computer programs used in calculations do not 
permit any simplification of calculations comparable to that achievable for the sample standard 
deviation. As a consequence, other resampling approaches must be considered. 

1 . 3 Grouped Jackknifing 

The grouped jackknife (Miller, 1964) is a less computationally intensive resampling alternative 
to the traditional jackknife. In this approach, the n observations are divided into k < n disjoint 
groups Gj, 1 < j < k, with approximately equal numbers of members. In the simplest case, the 
sample size n is a multiple of k, so that each group Gj can be selected to have n(Gj) = n/k 
members. For example, if n = 100 and k = 10, then one might have G\ contain observations 1 to 
10, G2 contain observations 11 to 20, and Guo contain observation 91 to 100. More generally, the 
groups can always be chosen so that \n(Gj) — n/k\ is less than 1. For example, if n is 101 and k is 
10, then G\ to Gg can be defined as in the case of k = 10 and n = 100; however, G 10 may now 
be defined so that G 10 contains observations 91 to 101. The weight vectors w gO), 1 < j < k, are 
selected so that w c(j) has ith coordinate Wio(j) equal to 1 if observation i is not in group Gj. 
Coordinate is 0 if i is in group Gj. For example, the delete-n(Gj) sample mean A[wg , (j)] 

is the sample mean of the X{ for observations i not in group Gj. With grouped jackknifing, the 
variance estimate ^ 

°g(9) = gU)} ~9g) 2 , 



where the average of the delete-n(Gj) estimates ry[w(j)]. 1 < j < k, is 

k 

g G = AT 1 ^ff[w G (j)]. 

3 =1 

For 0 < a < 1, the approximate confidence interval of level 1 — a for 7 has lower bound 

7 GLa = 9 ~ tk-l,a&G(g) 

and upper bound 

7 GUa = 9 + 4-l,ad G (g). 

Traditional delete-1 jackknifing is obtained in the special case of k = n, for each G(j) contains 
only one member of the sample, and w G (j) = w(j). When k < n, grouped jackknifing is different 
from delete -1 jackknifing even in simple cases such as estimation of the variance of the sample 
mean. In the applications under study, the number k is fixed by restrictions on computational 
resources. For example, in the equating problems under study, k is 120 no matter how large the 
sample size n may be. This restriction greatly reduces computational labor relative to alternatives. 
Delete-1 jackknifing requires n subsamples. Delete-d jackknifing, 1 < d < n — 1, requires all 
subsamples in which d members are omitted from the original sample (Shao & Wu, 1989), so 
that even more subsamples are required than for delete-1 jackknifing. For the applications under 
study, no obvious gain is achieved from use of k = 120 bootstrap samples rather than the grouped 
jackknife. 

The behavior of grouped jackknifing is easiest to examine if n/k is an integer and if g is the 
sample mean X. In this case, results can be regarded as quite satisfactory, although there is some 
loss in terms of width of confidence intervals if k < n. Nonetheless, this loss is small for k of 
moderate size. To verify this claim, let v G (j) = 1 — wg(j), so that vg(j') has coordinate v^g = 1 
for sample member i is in Gj and v^g = 0 if sample member i is not in Gj. Thus A'[wg(j)] is the 
average of the for sample members i not in group Gj , and X[vG(j)] is the average of the Xj for 
sample members i in group Gj. Because each group Gj has n/k members, n[wc(j)] = n(k — 1 )/k, 
n [ v G(j)] = n/k, and 

(k - l)X[w G (j)] + X[v G (j)} = X. 

The sample mean X is both the average of the delete-n/k sample means A[w g (j)], 1 < j < k, 
and the average of the sample means A[v g (j)] for sample members in group Gj, 1 < j < k. As in 
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the traditional jackknife, it is readily checked that 

k 

a 2 G {X) = [k(k - I)]" 1 ^{X[v G (j)] - X} 2 . 

i=i 

Because the Gj are disjoint groups and the Xi are independent and identically distributed, the 
sample means A[v G (j)], 1 < j < k, are independent and identically distributed with common 
mean n and common variance r 2 /(n/k). Thus Gq{X) has mean a 2 ,(X), so that a 2 ,(X) is an 
unbiased estimate of the variance of X. For 0 < a < 1, the approximate confidence interval for /j 
of level 1 — a has lower bound 

kGLa = x — 4 — \,a^G{X) 

and upper bound 

kGUa = X + t k -\, a CFG{X). 

If the X t are normally distributed, then the X[v G (j)] are also normally distributed, so that 
(k — 1 )g‘q(X) / ctq(X) has a chi-square distribution on k — 1 degrees of freedom and (X — /x)/ gq{X) 
has a t distribution on k — 1 degrees of freedom. This exact result is not available if bootstrapping 
is used. For 0 < a < 1, 1 — a is the probability that nGLa < ^ < kGUa- Results are quite different 
from the traditional jackknife to the extent that o^(X)/a 2 (X) does not converge in probability 
to 1 as the sample size becomes large (Shao & Wu, 1989). Because a chi-square random variable 
with k — 1 degrees of freedom has mean k — 1 and variance 2 (k — 1), the ratio <t 2 t .(X)/cj 2 (X) has 
mean 1, variance 2 /{k — 1), and standard deviation [2 j(k — l)] 1 ^ 2 for all sample sizes. In the case 
of k = 120 considered in this report, the standard deviation [2 / 119] 1//2 = 0.13 is certainly not 
negligible, so that variability of <j'q(X) cannot be ignored. Despite the variability of a 2 ,(X), the 
impact on confidence intervals for jj, is relatively small if a g(X) is used instead of <j(X). Recall 
that in traditional jackknifing, the expected width of the confidence interval at level 1 — a is 

E{kJUa ~ kJLa ) = E(fx Ua - ii La ) = 2f n _i !a r(n/2)[2/n(n - l)] 1/2 T/T((n - l)/2)). 

A very similar argument may be employed to show that the expected width of the confidence 
interval at level 1 — a from grouped jackknifing is 

E(kGUa - kGLa) = 2t k -i iCl T(k/2)[2/n(k - l)] 1/2 r/r((fc - l)/2)). 

As the sample size becomes large, the ratio 

EjkGUa - kGLa ) _ 4-i,qT(fc/2)n 1/2 r((n - l)/2) 

E{kJUa - kJLa ) 4 -i,a/ 2 r (n/ 2 )(fe - l) 1 / 2 r(n/2) 
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approaches 


2 1 / 2 t,(k~l,a)T(k/2) 

(k - l) 1 / 2 r((fc - l)/2) ’ 

For k = 120, this ratio is 1.0082, a value only slightly greater than 1. 

To obtain more general results concerning grouped jackknifing for the sample mean, apply 
the central limit theorem and the Mann-Wald theorem (Rao, 1973, p. 124). It follows that, as 
the sample size n becomes large, the distribution of (k — 1)(Tq(X)/Oq(X) has an approximate 
chi-square distribution on k — 1 degrees of freedom, and (X — g)/ac(X) has an approximate t. 
distribution on k — 1 degrees of freedom. Thus, even for large samples, it remains the case that 
&q{X) has limited accuracy as an estimate of a 2 (X). Nonetheless, as the sample size n becomes 
large, the probability approaches 1 — a that gcLa < g < gcua- As long as r + (X t — g) 2 /(2 t) has 
finite variance, it remains the case that 

E{HGUa — IMSLa) 

E(HJUa ~ HJLa ) 

approaches 

2 1 / 2 t{k-l,a)T(k/2) 

(fc — i)]i / 2 r((fc —1)/2) 

as the sample size increases. 

The basic results for the sample mean extend readily to more general estimates g. Define the 
Yj as in the case of the traditional jackknife so that <?[w] is approximated by 7 + Y, the Y t are 
independent and identically distributed, and the Yj have common mean 0 and common variance 
v 2 > 0. Conditions for large-sample approximations are a bit weaker than in the traditional 
jackknife (Shao & Wu, 1989). It suffices to have (3) hold and to have 

max E{{R - R[w G (j)]} 2 )/a 2 (Y) -»• 0 (5) 

l<j<k 

hold as the sample size n increases (Shao & Wu, 1989). Under these conditions, it remains true 
that the variance cr 2 (g) is well approximated by the variance a 2 (Y) in the sense that cr 2 (g)/o 2 (Y) 
converges to 1 as the sample size n becomes large. In addition, the bias E(g) — 7 is sufficiently 
small that [E(g) — 7 ]/cx (g) converges to 0 as n becomes large. As in the case of the grouped 
jackknife of the sample mean, the ratio cr G (g) / cr 2 (g) does not converge in probability to 1 as 
the sample size increases. Instead, the distribution of (k — l)aQ(g)/a 2 (g) has an approximate 
chi-square distribution on k — 1 degrees of freedom, and (g — 7 )/cFc{g) has an approximate t 
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distribution on k — 1 degrees of freedom. It follows that, as the sample size n increases, the 
probability approaches 1 — a that 7glq < 7 < iGUa- 

Grouped jackknifing can be used for some equating designs to evaluate equating error; 
however, in the cases under study in this report, it is probably appropriate to consider an 
adaptation of grouped jackknifing to stratified random sampling. 


1.4 Grouped Jackknifing for Stratified Random Samples 

Jackknifing is often applied when sampling is much more complex than in the case of simple 
random sampling (Wolter, 1985). A variety of possible approaches exist. In the analysis of 
equating under study, grouped jackknifing is applied to statistics computed from data from two 
independent stratified random samples. Similar studies can be performed on data from several 
independent random samples. To illustrate the approach, consider the case of H > 2 populations. 
For each population h, consider > 2 observations X,^, 1 < i < n^, derived by simple random 
sampling with replacement. A basic requirement for grouped jackknifing for the stratified case 
is that it works in a satisfactory manner when a linear combination of sample means is used to 
estimate a corresponding linear combination of population means. As in the case of grouped 
jackknifing or delete -1 jackknifing for simple random sampling, further use of jackknifing can then 
be justified by consideration of parameter estimates well approximated by linear combinations of 
sample means. 

To examine the estimation problem for sample means, let the independent random variables 
Xih have mean /p, and variance t? > 0 for 1 < i < nh and 1 < h < H. Consider estimation of a 
linear combination 

H 

7 = Ch ^ h ( 6 ) 

h =1 

of the means 1 < h < H, for some real numbers Ch, 1 < h < H. For example, if Ch = H~ 1 for 
each population h. then 7 is the average p, of the population means ///,,. The conventional estimate 
of 7 is the linear combination 

H 

g = Y^C h X h (7) 

h= 1 


of the sample means 


-X-h = ^ ^ Aj/j. 

i =1 
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The mean of g is 7 , so that g is unbiased, and the variance of g is 

H 

v 2 (9) = ^2 c h T h/n h - 

h= 1 

One may estimate the variance cr 2 (g) of 7 by 

H 

° 2 {g) = 5 Z c h s h/n h , 

h= 1 

where the sample variance of the X^, 1 < i < is 

rih 

s 2 h = (n h -l)~ 1 ^2(X ih -X h ) 2 . 

2=1 

The ratio a 2 (g)/a 2 (g) converges in probability to 1 if each sample size rih becomes large, and 
(g — 7 )/a(g) has an approximate standard normal distribution if all are large. 

As in simple random samples, weights can be applied to stratified random samples. Let w 
have nonnegative integer coordinates Wih, 1 < i < n-h, 1 < h < H, let n^fw] = Y^l=i w ih > 0 be 
the weight sum for sample h, and let 


nh 


Xh[ w] = {n/j[w]} l ^w ih X ih 


2=1 


be the weighted sample mean for sample h. Corresponding to the linear combination g of (7) is 
the linear combination 

H 

g[w] = ^2c h X h [w]. ( 8 ) 


h= 1 


Then ^[w] has mean 7 . Similarly, if n/j[w] > 1, then one may let 


n h 

s !t w ] = {«/i[ w ]} - i )” 1 ^2 w ih{X ih - X h [w]} 2 , 

2=1 

so that if each is 0 or 1 , then s 2 [ w] is the sample variance of the observations X t ^ for which 

Wih = 1. If the are all 0 or 1, then -s|[w] has expectation T f 2 . If each Wih is 1, then A/, [w] = A/,, 

s h[ w ] = and g [ W 1 is 9- 

In the version of grouped jackknifing considered here, for a given positive integer k no greater 
than the minimum of the sample sizes rih, 1 < h < H, the sample members drawn from population 
h are divided into k groups Gjh, 1 < j < k, of approximately equal size. If rih/k is an integer, 
then each group Gjh contains n(Gjh) = nh/k observations. In general, \n(Gjh) — rih/k\ is less 
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than 1. For example, consider H = 2, let k = 10 groups be taken from for n\ = 100 members 
of the first sample and n 2 = 200 members of the second sample. In this case, G\\ might be 
observations 1 to 10 from the first sample, and G \2 might be observations 1 to 20 from the second 
sample. One might have Ggi equal to observations 81 to 90 in the first sample and G 92 equal to 
observations 161 to 180 from the second sample. In grouped jackknifing, weight functions w Gs{j), 
1 < j < k, are considered such that w Gs{j) has coordinates WihGs(j ), 1 < i < n^, 1 < h < 2, such 
that WihGsij) is 1 if i is not in Gjh and 0 if i is in Gjh- For example, Ad[wGs(l)] is the average 
of the observations Xu for sample members i from the first sample that are not in group G\\. 

In applications, the standard estimate of a real parameter 7 is g = < 7 [ 1 gs]), where has all 
coordinates lihGS = 1- In addition, the estimates g[wGs(J)] are used for variance estimation. It is 
assumed that g and g [w^s (j)] have finite variances. At this point, calculations are essentially the 
same as for the grouped jackknife for simple random sampling with replacement. The variance 
cr 2 (g) is estimated by 

ogs(9) = ^2(s[^gsU)} - gas ) 2 , 

3 = 1 


where 


k 

gas = fc -1 ^5 [ w gs(j)]- 

3 =1 


For 0 < a < 1, one has an approximate confidence interval for 7 of level 1 


a with lower bound 


iGSLa = g - t k -l, a VGs(g) 

and upper bound 

iGSUa = g + tk-l,a&Gs(g)- 

Even in the elementary case of g defined as in (7) and 7 defined by ( 6 ), the variance estimate 
< 7 Qs(g) is n °t the same as cr 2 (g). Nonetheless, it is not difficult to verify that a 2 ;s (g) has 
expectation u 2 {g) if the rih/k are integers for each sample from population h. In addition, if the 
Xih are normally distributed, then (k — l)aQ S (g)/a 2 (g) has a chi-squared distribution on k — 1 
degrees of freedom, and ( g — 'y)I^Gs(g) has a t distribution on k — 1 degrees of freedom. Thus the 
probability is exactly 1 — a that 7 GSLa < 7 < 'jGSUa■ 

In more complex applications under study, large-sanrple approximations are required similar 
to those for grouped jackknifing with simple random sampling with replacement. In addition, 
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computational constraints require that k not increase even if the sample sizes rih. become large. 
In this case, grouped jackknifing applies when independent random variables Y^, 1 < i < rih, 

1 < h < H, are available such that, for each population h, the Yjft are identically distributed with 
common mean 0 and common finite variance > 0. Let 

n h 

n = n^Y,Y ih , 

i= 1 


n h 

%[ w] = {nfefw ]}^ 1 y^/WjhYjh, 
1=1 


and 


H 

/ = £u. 

h =1 
H 

/M = 

h =1 


The Yih are selected so that g is well approximated by 7 + / and g[w] is well approximated by 
7 + /[w]. The approximation errors 


Rgs = 9 - 7 - / 


(9) 


and 

-Rgs[wgs(j)] = g[wGs(j)} - 7 - /K's(j)] (10) 

must be small for large sample sizes n^, 1 < h < H. To be more specific, it is assumed that 


E(R 2 GS )/a\f) -> 0 


( 11 ) 


and 

max E({R gs - i?G 5 [w G s(j)]} 2 )/cr 2 (/) -> 0 (12) 

l<j<k 

as the sample sizes rih increase for all populations h. For g defined by (7) and 7 defined by (6), 
these conditions hold trivially for Yih = Ch(Xih — gh) and v\ = c|r 2 . Under these conditions, 
the variance ration cr 2 {g)/cr 2 (f) converges to 1 as the sample sizes n/, all become large, and the 
bias E(g) — 7 is sufficiently small so that [E(g) — 7] /cr(^r) converges to 0 as the sample sizes rih 
increase. In addition, for large sample sizes rih > (k — l)a GS (g)/a 2 (g) has an approximate chi-square 
distribution on k — 1 degrees of freedom, and (g — 7 )/<JGs{g) has an approximate f distribution on 
k — 1 degrees of freedom. Thus, as the rih all become large, the probability approaches 1 — a that 
7 GSLa < 7 < iGSUa- 
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In section 2, sample h corresponds to examinees in Administration h of an educational test. 
In the specific example presented, only H = 2 administrations are examined. Equating is studied, 
so the parameter 7 in the application under study may be the score on Administration 1 of the 
test that may be regarded as equivalent to a specified score s on Administration 2. The parameter 
7 can be regarded as the equating result that would be obtained were the population distribution 
of examinee responses known for each administration. 

1.5 Jackknifing Comparisons 

In applications in this report to linking of forms, a major issue involves comparison of 
different linking functions based on different sets of anchor items. The basic analysis is readily 
accomplished given the sampling procedure and grouping procedure in section 1.4. For some 
integer D > 1, consider M different estimates g m , 1 < m < M, for the respective parameters q m . 
In typical applications in this report, m will correspond to a particular set of anchor items that 
might be employed in equating and g m will be the estimate for anchor set rn of a specific equating 
result 7 m that would be obtained were all data available on all population members. For example, 
for a specific raw score point, there may be M different raw-to-raw conversions g m , 1 < m < M , 
from one form to another that have been produced from equating with M different sets of anchor 
items. Of interest here is the variability of the parameters for 1 < m < M. It is assumed that 
the g m have finite variances. Let the average of the y m , 1 < rn < M , be 

M 

7 . = M~ l ^2 7m- (13) 

771=1 

One simple measure of the variability of the parameters 7 m , 1 < m < M, is their sample variance 

M 

0-7 = ( M “ 1 ) _1 -7-) 2 - (14) 

771=1 

One may estimate the average parameter value 7 . by the corresponding average estimate 

M 

9 • = 9m ’ ( 15 ) 

771=1 

and the sample variance cr^ of the parameters j m , 1 < m < M, may be estimated by the 
corresponding sample variance 

M 

£7 = {M - lr 1 ^2(g m -g ) 2 (16) 

771=1 
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of the estimates g m , 1 < m < M. The average g. has expectation 

M 


E(g.) = M - 1 £ 7 m- 


(17) 


ra=l 


If one recalls that E(Y 2 ) = [E(Y)] 2 + <t 2 (Y) if Y is a random variable with a finite variance, then 


one finds that the expectation of the sample variance a 2 is 


M 


M 


E(a 2 ) = (M - I)” 1 £ [E(g m ) - E(g.)} 2 + (M - l)" 1 ]T a 2 ( 5m - g). (18) 

m=l m— 1 

In typical cases, the variance estimate a 2 has a positive bias as an estimate of the sample 
variance a 2 of the 7 m . This condition is readily observed in the elementary case in which one has 
independent random vectors X,/, with mean fi h and positive-definite covariance matrix C h > 0 
for 1 < i < rih and 1 < h < H. Let coordinate m of be , and let coordinate m of /x be 
fi m . For 1 < m < H, consider estimation of a linear combination 

H 

7 m = ^2 C mh p h ( 19 ) 

h= 1 

of the means ii m h , 1 < h < H, for some real numbers c m h , 1 < h < H. For example, if c m h = H^ 1 
for each population h, then 7 m is the average g m of the population means /%/,. The conventional 
estimate of is the linear combination 

H 

9m = ^ ^ CmfiXmh (20) 


h= 1 


of the sample means 


n h 


X m h — ^ . 


i=l 


The mean of g m is 7 m , so that is unbiased, and the mean of g. is 7.. For a vector b, let b' 
denote its transpose. Then the expectation of a 2 is 


M 


E{a 2 ) = cr 2 + ^ cr 2 (g„ 


9) 


where the variance of g m — g. is 


m=1 


H 


cr 2 {g m - g ) = ^2 b' mh C h b m h/n h 


h =1 
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and b m /j is the M-dimensional vector with coordinate b m > m h , 1 < m! < M, such that b m > m h is 
Cmh{M — 1 )/M for m! = d and b m ' m h is —c m 'h/M for m' / d. 

To investigate this bias in estimation of the sample variance a^ of the parameters 7 m , 

1 < rn < M, grouped jackknifing for stratified random sampling may be employed. The basic 
requirement is that each g m satisfy the requirements for grouped jackknifing described in 
section 1.4 for the case of stratified random sampling. Consider the following conditions. Let 0 be 
the M-dimensional vector with all coordinates 0. Let Yj/j, 1 < i < rih, 1 < h < H, be independent 
M -dimensional vectors with coordinates Y rni h , 1 < m < M, such that, for sample h, the Y,/, are 
identically distributed with common mean 0 and common positive-definite covariance matrix Y/j. 
Let row m and column m! of Y^ be Y mm 'h- Let 

n h 

Y m h = ''y ^ Y m hj , ( 21 ) 

i= 1 


let the average of the Y irn h over m be 


M 

Y ih = M~ l Y ™ih, 

m= 1 


( 22 ) 


and let 


so that 


n h 

Y-h = n h ^ Y.hi, 
i= 1 

M 

Y h = M~ 1 Y J Ymh■ 

m =1 


(23) 


Similarly, for the weight function w with integer coordinates Wih. > 0, 1 < i < rih, 1 < h < H, let 


and 


whenever n/Jw] > 0. Let 


and let 


n h 

Ymh.[ w] = {n h [™]}- l ^2,w ih Y mih 
2=1 

nh 

Y h .[w] = {n h [™]}- l ^2,w ih Y. ih 
2=1 


H 

fm = E Y m h > 

h =1 


H 

fm[ w] = ^ Y mh [w\. 

h= 1 


(24) 


(25) 


(26) 


(27) 
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Note that f m has variance 


& ( fm .) — ^ ' Y mmht 
/i=l 

The Tmi/j are selected so that g m is well approximated by 'y m + f m and g m [w] is well approximated 
by 7 m + fm [w]. The approximation errors 

RmGS = 9 m 7m — fm. (28) 


RmGs[wGs(j)} = 9m[wGs(j )] ~ 7 m~ fm[™Gs(j )] 
must be small for large sample sizes n^, 1 < h < H. To be more specific, it is assumed that 

E{RmG S )/°- 2 {fm ) - 0 


max E{{RmGS - Rm.Gs[™Gsti )}} 2 )/ ^(fm) 0 (31) 

l<j<k 

as the sample sizes increase for all populations h. 

Because the matrices Y m are positive definite for 1 < to < M, (30) and (31) imply that (11) 
and ( 12 ) hold whenever c m , 1 < m < M, are real numbers, some c m is not 0 , 

M 

9 = 'y ' c m g m , 


7 — 7.7i7mi 


EjJi ^ 

m=l 

It follows that cr 2 {g)/cr 2 (f) converges to 1, the bias E(g) — 7 is sufficiently small that [£ 1 ( 5 ) — ^\/cr{g) 
approaches 0 as the sample sizes rih all become large, and the ratio (k — l)aQ S (g)/a 2 (g) has an 
approximate chi-square distribution on k — 1 degrees of freedom. Consideration of the differences 


g rn — g. shows that the bias 


is well approximated by 


A = E(a 2 ) - <r? 


A 0 = (M - 1 ) 1 a2 (9m - 9-) 
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in the sense that A/Ao converges to 1 as the sample sizes all become large. If /. is the average 
M ~ l Em= 1 fm and if 

M 

Ai = (M — 1) _1 ^2 a2 (fm — /■)> (34) 

m= 1 

then A/Ai also converges to 1 as the sample sizes all become large. 

One may approximate the bias A with the grouped jackknife by use of 

M 

Acs = (M - 1) _1 ^2 °Gs(dm - g ), (35) 

m— 1 

so that a bias-corrected estimate of ai is 

^GS'r = ~ ^GS- (36) 

To understand this correction, consider a large-sample approximation in which the fraction 
of observations from each population has a positive limit. For this purpose, let n+ = Yih=i n h 
be the total sample size. Let n + become large and, for each population h, let the ratio n/ l /n+ 
approach a positive constant u>h- Then the large-sample distribution of Acs /A may be studied 
by use of general results concerning the distribution of quadratic functions of multivariate normal 
random variables (Box, 1954). Let Q be the M by M matrix with row m and column m' equal to 
1 — M~ l if m = m! and equal to — M _1 if m ^ m!. Let 

H 

n = Qj2u h r h . (37) 

h =1 

Let tr denote a trace of a square matrix. Then Aqs/A converges in distribution to a positive 
random variable Z with expectation 1 and with variance 

E{Z) = 2{k — l) _1 77T^ri]^' 

The trace tr(fifi) is the sum of the squares of the M — 1 nonzero eigenvalues of f2, while tr(fi) 
is the sum of the M — 1 nonzero eigenvalues of (Box, 1954). The Cauchy-Schwarz inequality 
may be used to demonstrate that Z has variance less than 2 /(k — 1) but at least as large as 
2 /[(k — 1 )(M — 1)]. Note that |A is well approximated by Ai, and Ai is of order of magnitude 
equal to the largest of the inverse sample sizes n j) 1 for 1 < h < H. Thus the bias A is small in 
large samples, and the bias correction Acs is also small in such cases. 
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An exact result is available in the special case of g m = 7 m + f m for 1 < m < M, 7 m constant 
over m, nh./k an integer for 1 < h < H, and Y m hi independent normal random variables with 
common variance for all m and i. Application of standard results from two-way analysis of variance 
with one observation per cell shows that a 2 = 0, the bias A is cr 2 (/i), and (k — 1 )(M — 1)Ags/A 
has a chi-squared distribution on (k — 1 )(M — 1) degrees of freedom, so that Aqs /A has mean 
1 and variance 2 /\{k — 1 )(M — 1)]. The variance of Aqs/A decreases as the number k of groups 
increases and as the number M of estimates g rn , 1 < m < M, increases. In addition, <r 2 (/i) 
is approximately proportional to the total sample size n^ 1 , and the ratio a 2 /Aqs has an F 
distribution with M — 1 and {k — 1 )(M — 1) degrees of freedom. Note that in this case, there is a 
substantial probability that bias-corrected estimate otq S is negative, even though the estimated 
quantity a 2 must be nonnegative. 


1.6 Randomly Selected Estimates 

The analysis in section 1.5 raises a rather basic issue in the context of equating. In many 
cases, it is assumed implicitly in equating that different selections of anchor sets should lead to 
the same basic equating results. For example, except for sampling error, conversions of scores on 
a new form to an old form should be the same. In practice, errors are encountered both due to 
the failure of equating assumptions and due to sampling error. One simple assessment considers 
a randomly selected estimate gs, where S is uniformly distributed on the integers 1 to M and 
independent of the g m . Thus gs is g m with probability 1. The estimate gs reflects results of 
equating if the anchor set really is randomly selected. The expected value of gs is the average 

M 

E(g s ) = E(g.) = M~ l ^ g m (38) 

m— 1 

of the expectations E(g m ), 1 < m < M. The variance a 2 (gs) of gs has two components, the 
expected conditional variance of gs given S and the variance of the expected conditional mean of 
gs given S (Rao, 1973, p. 97). It follows that 

v 2 (gs) = M M 1 <y 2 + M~ l ^ v 2 (9m)- (39) 

m= 1 


The bias E(gs) — 7 . is sufficiently small that 

E(gs) - 7- 




11/2 
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as the sample sizes n^ all become large. 

In addition, the random difference gs — <?■ and the average estimate g. are uncorrelated. To 
verify this claim, note that (38) implies that gs — g ■ has expectation 0. Thus the covariance of 
gs — g . and g. is the expectation 

M 

E{[gs - g ]g ) = AT 1 Y E([g m - g]g.) = E([g. - g.]g.) = 0. (40) 

m= 1 

As in (39), 

M — 1 

° 2 {gs - g ) = M i t 2 + at 1 Y ° 2 {g™ - g■)■ (41) 

171= 1 

Combination of (39), (40), and (41) leads to 

° 2 {gs) = + a2 (g-) + M ~ l Y a2 ( 9m ~ ^ 

m= 1 

With jackknifing, (35) implies that cr 2 (gs ) nray be estimated by 

°Gs(9s) = M m + °Gs(g)- ( 43 ) 

In (43), the first component on the right-hand side assesses variability in the estimates g m , 

1 < m < Af, and the second component measures the variability of the average estimate g.. As 

the sample sizes increase for all populations h, cr 2 {gs) approaches [(Af — 1)/M]a 2 . If the 7 m 

are not all the same, then a 2 > 0 and this limiting variance is positive. No matter how large are 
the samples, accuracy is then limited by the inconsistency of parameters y m , 1 < m. < Af, for 
different anchor choices. Interpretation is to some degree made more complicated because anchor 
items are not really chosen at random. Nonetheless, the analysis can provide some measure of the 
impact of anchor choice. 

1.7 Overlapping Anchor Sets 

In many common cases, including the examples to be presented in section 3, one anchor 
set was actually employed in equating. For instance, in one case, an anchor set consisted of 28 
items. Alternate anchor sets are obtained by deletion of single items of groups of items. Thus the 
possible anchor sets used are very similar. For example, one might consider 28 anchor sets derived 
from the original 28 items by deletion of one item. One would expect that this similarity of anchor 
sets would result in less variability related to choice of anchor sets than would be encountered 
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were completely different anchor items employed. It is possible to try to estimate the effects of 
more thorough changes in anchor sets from the limited selections available, but some reasonable 
assumptions must be made. These assumptions can be similar to those used in jackknifing. They 
are relevant when anchor items or anchor blocks are selected at random from a large enough finite 
population so that corrections for finite populations can be ignored. The extent to which this 
model is realistic can be debated, for anchor items are not chosen at random in typical cases. 
Nonetheless, the analysis may still provide insight into reasonable expectations for variability. Let 
there be M anchor items (or anchor blocks) / m , 1 < rn < M, selected at random and used in an 
assessment. 

For a specific choice of anchor items I m , 1 < m < M, let gi m , 1 < m < M, represent an 
equating result based on use of the anchor items I m > for nn! / m, and let gio be an equating result 
based on use of all the anchor items I m , 1 < m < M, and let g\ rn estimate r yi m . Let 

M 

7t = M_1 22 Tim- (44) 

m= 1 


and let 


Define the estimates 


and 


M 


(M-l) 1 ^(7im-7i-) 2 - 

m=l 

(45) 

M 


gi- = 1 22 gim 

m=l 

(46) 

M 


(M - 1 ) _1 22(9 Im - gi-) 2 ■ 

(47) 


m =1 


Assume that each g\ m has a finite mean and a finite variance. 

To treat randomly selected estimates, for 0 < m < M, let g m be the random estimate with 
value gim if anchor items I m i, 1 < m' < M, are selected. Similarly, let be the random variable 
with value 71 m if the anchor items I m >, 1 < m' < M, are selected. Let 7 . denote the random 
variable with value 71 . if I m , 1 < rn < M, is selected, and let denote the random variable with 
value < 7 j 7 if / m , 1 < m < M, is selected. The estimated equating result in practice is go. The 
estimate go in effect estimates the expectation E( 70 ) of 70 . The variance of cr 2 (go) is the sum of 
two components. The first component is the expected value t?i{go) of the random variable v 2 (go) 
with value equal to cr 2 (gio) if I m /, 1 < m' < M, is selected. The second component is the variance 
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o'Kfl'o) of the random variable with value E(gio) if I m >, 1 <m'< Af, is selected (Rao, 1973, p. 97). 
In this section, conditions are developed under which both components can be approximated. The 
first set of conditions permits use of jackknifing to approximate d 2 {gim) for any possible selection 
of anchor items f m /, 1 < m! < Af. This set of conditions is essentially the same as in section 1.5. 
Additional conditions are then imposed to permit approximation of a 2 (go). These conditions are 
somewhat related to those developed in section 1 for jackknifing for simple random sampling, but 
they apply to items rather than to examinees. 

It is assumed that, for any selection of anchor items I m >, 1 < m! < Af, the gi m satisfy the 
basic conditions for grouped jackknifing in stratified random samples that were described in 
section 1.5. Thus one has independent pairs Yjjfr), 1 < i < rih, 1 < h < if, where Yj,/, 

has coordinates Yi m ih , 1 < m < Af. For population h, the pairs (Yioifr, Y^fr) are identically 
distributed for 1 < i < rih, Yioih has mean 0 and finite and positive variance Yioo/i, and Yuh has 
mean 0 and finite positive-definite covariance matrix Ti/j. For 0 < m < Af, define 


let 


and let 


whenever n/Jw] > 0. Let 


and let 


Let 


Let the approximation errors 


rih 


Ylmh — ^1 mhii 


2=1 


M 

Yi. h = Af 1 ^ Y Imh , 

m= 1 


nh 

Ylmh [w] = {nfofw]}- 1 y WjhYimih 

2=1 


H 

flm — ^ ^ 11 mill 
h= 1 


H 

/lm[w] = ^ Yi m h[w]. 

h= 1 


H 


fi■ = ^2 Yi.h, 
h= 1 


(48) 

(49) 

(50) 

(51) 

(52) 

(53) 


= 91 m - Jim ~ flm 


(54) 


and 


-RlmGs[wc?s(f)] = ffIm[wGs(j)] - Jim ~ flm[™Gs(j)] (55) 
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satisfy the conditions that 


(56) 


Ezines)/Aflm) - 0 


max E({Ri m GS ~ RlmGs[^Gs(j)}} 2 )(flm) ~> 0 (57) 

l<j<k 

as the sample sizes rih increase for all populations h. 

Under these conditions, as the sample sizes rih approach oo, the following limiting relationships 
hold: 

E(ao) ~ 710 - 0, (59) 

17 {9l0) 

and the ratio (k — l)<JGg{gio)/{gio) converges in distribution to a random variable with a 
chi-square distribution on k — 1 degrees of freedom. If ^Qg(go) denotes the random variable 
with value ^Qg(gio) if the anchor items I m , 1 < m < M, are selected, then one can certainly 
approximate erf (go) by use of aQ S (go). 

To estimate <r|(go) requires some assumptions concerning the parameters 71 m and the random 
variables Yi m ih for 0 < m < M. The assumption made here is that the parameter qi m has a 
decomposition 


1 (3+(M - 1) 1 Y.m'+m m') + Cl m, 1 < m < M, 

Wm — S . _ , r _ ,, 


{ P + M Em'=l Em'=l "(m + ClO>. m = 0, 

where the constants are remainder terms, and the random variable Yimih has the decomposition 



Zih + (M — 1) 1 E ro '/ m Uih(Im') + ei mi 1 < m < D, 
ZiQ + M -1 Em'=i Em'=i vi 1 ™,') + ei mih, m = 0. 


(61) 


where the random variables ei nn h are remainder terms. In (60), 0 is the average of the is(A) over 
all possible anchor items A. and a 2 is the variance of a random variable u m with value z/(/ m ) if I 
is randomly selected. In (61), the components and Uih(Im') are all independently distributed. 
For each sample h, the Z t ^ are identically distributed with mean 0 and finite variance a 2 Zh and the 
Uih(Im') are identically distributed for each anchor item I m and have mean 0 and finite variance 
crfjhim > 0. The parameter afj m denotes the mean of the random variable with value afj h (I m ') 
if the I m ' are selected at random for 1 <m'< M. 
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In (60) 


where 


M 

7i ■= (3 + M 1 22 v{I m ) + Ci-? 

m= 1 


M 

Cl = M ~ l 22 Ci m 

m= 1 


In (61), 

M M 

Ylih = M 1 22 Ylraih = Zih + M 1 ^ Uih{I m ) + ei.jft, 

m=l m=l 


where 


M 

&lm-ih = M ^ ^ 6l mih- 
m= 1 


Thus 


<i'i = 7io - 7i- = Cio - Cl¬ 


aud 

Ylih = “ Yl-ih = e I0i/). “ 

Let C- be the random variable with value Ci- if Im7 1 < m! < M, is selected, let Wyh be the 
random variable with value E(e\, ih ) if I m >, 1 <m'< M, is selected, let (5 be the random variable 
with value 5i if I m r, 1 < m! < M, is selected, and let Wyh be the random variable with value 
E(Vy h ) if I m '■> 1 <m'< M, is selected. The approximate methods used in this section require that 
C, 6, E(Wyh) , and E{Wyh) all be small relative to <t 2 /M. To examine this claim, consider the 
simplified case in which (j-, Cio> ei-ihi an d eio ih are all 0 for any anchor items I m >, 1 <m'< M. In 
this case, comparison with delete-1 jackknifing for sample means shows that cr 2 (go) is °‘i /Tf and 

M 

<2 = (m- 1) -3 22( Um - ^-) 2 ’ 

m =1 

where v. is the average M _1 ]T)^ =1 v m of the u m , 1 < m < M. The expectation of cr 2 is then 
(M — 1) _2 <t3, so that M~ 1 (M — l) 2 cr 2 has expectation o^Cffo)- In addition, go and g. are equal. If 
a‘l; S (g.) denotes the random variable with value d 2 T<s (gi.) if the anchor items I m , 1 < rn < M , are 
selected, then crf(go) nray be approximated by &gs(9-) as we ^ as by &gs(so)- If 

M 

Aigs = (M - l) -1 22 °Gs(9im - 9l ) (62) 

m =1 
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for anchor items I m >, 1 < m! < Af, and if A G g is the random estimate with value Ai G g if the 
I m >, 1 < m' < Af, are selected, then er|($o) may be approximated by (Af — 1) 2 [<t 2 — A G s]. 
Approximations are most satisfactory if the number k of groups and the number Af of anchor 
items is large. 

If cr 2 is 0 and if, for each population h and the Uih(Im’) all have the same variance, then 
cr 2 = 0 and 

Fgs = (M - 1) 2 <t 2 /A gs (63) 

has an approximate F distribution on Af — 1 and (Af — 1 )(k — 1) degrees of freedom. The result is 
exact if each ratio rih/k is an integer and if the Uih{I m ') have normal distributions. 

The suggested estimate of o-Q S (g 0 ) based on the case of no remainder errors is 

oasis 0 ) = VGs(go) + m~ 1 (m - i) 2 [<3- 2 - a gs ]. (64) 

A slight modification o'csigo) of this estimate has the attraction that it can be computed from 
a two-way array of estimates g m [w G g(j)], 1 < j < k, 1 < m < M. Let fl , m [w G g(j)] be the 
random estimate with value gi m [ w Gs(j)\ if items I m >, 1 < m' < Af, are selected. The average of 
the g m [wGsU)}> 1 <m< Af, can be denoted by 5-[ w Gs(j)], and the average of the g.[w GS (j)], 

1 < j < k, can be denoted by g.. This modified estimate <r G g(go) is the same as crQ S (g o) if Ci- ; Cio 5 
ej.ih, and eio ih are all 0. To define the modified estimate, let c/ rn [w G g(j)] be the random estimate 
with value gi m [ w Gs(j)] if items f m /, 1 < m' < Af, are selected. The average of the <? m [ w G s(i)], 

1 < m < Af, can be denoted by g.[w G g(j)], and the average of the g. [w G g(j)], 1 < j < k, can be 
denoted by g.. For each item I m , the average of the g m .[ w Gs(j)\- 1 < j < k, may be denoted by 
g m . One has 

°gs{9o) = ^{g [wGs{j)} - g } 2 , 

3 =1 

M 

G 2 = (Af - 1) _1 ^2(g m -g-) 2 , 

m= 1 

and 

k _ i M 

Acs = —j—( M - f) _1 _ 5 -[ w G5(j)] - 9m + g } 2 - 

m =1 

It follows that 

°Gs(go) = ^gs(9o) + M 1 (Af - l) 2 [d 2 - A G s]. (65) 


27 



The approximations used in this section are less precise in practice than approximations used 
in earlier sections, for the number of anchor items is typically somewhat smaller than the number 
of groups and is much smaller than the actual sample sizes. Nonetheless, the basic issue remains 
that, as in section 1 . 6 , the variance of go does not approach 0 even for large sample sizes unless 
the parameter 70 is the same for all possible anchor items I m , 1 < m < M. 

2 IRT True-Score Equating 

In the examples under study, IRT true-score equating is used for equating of two 
administrations, Administration 1 and Administration 2, by use of a collection of common external 
anchor items I m , 1 < m < M. The approach of Stocking and Lord (Stocking & Lord, 1983) is 
used with a generalized partial credit model (Muraki, 1997). This approach reflects practices of 
the particular testing program under study. Numerous alternatives are available (Hambleton, 
Swaminathan, & Rogers, 1991, ch.9). The number of examinees in Administration h is denoted 
by rih■ In Administration 1, each examinee receives items I m for M + 1 < m < M\ where 
Mi > M + 1, and these items are used to score the examinee performance. In Administration 2, 
each examinee receives items I m , M\ + 1 < m < M 2 , where M 2 > M\ + 1, and these items 
are used to score the examinee. In addition, some examinees from each administration receive 
the common anchor items I m , 1 < m < M . It suffices to assume that whether an examinee in 
Administration h receives the external anchor items I m , 1 < m < M, is a random event not 
related to any characteristics or responses of any of the examinees. 

Estimation is performed with an item-response model in which the proficiency distribution of 
the population of examinees for Administration 1 is a standard normal proficiency distribution, 
while examinees who receive Administration 2 are assumed to have a proficiency distribution that 
is normal with mean B and standard deviation A > 0. Conditional on the proficiency 0 of an 
examinee, it is assumed that item scores for each item presented are conditionally independent. 
Item scores for an item I m have possible values from 0 to r m — 1, where r m is an integer greater 
than 1. The conditional probability P m {k\9) that an examinee with proficiency 9 receiving either 
form has response score k on a presented item I m is assumed to satisfy the logit relationship 

log[P m (/c|0)/P m (fc - 1|0)] = Da m {9 - b m + d mk ), 
where item discrimination a m is an unknown positive real number, item difficulty b m is an 
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unknown real number, and the category coefficients d mk , 1 < k < r m — 1 , are real numbers 
unknown save for the constraint that their sum is 0 (Muraki, 1997). Thus d u k = 0 if r m = 2. The 
constant D is fixed. It may be chosen to be 1, 1.7, or 1.702. The last choice is made here for 
consistency with the Parscale software used in computations at ETS. 

In the case of Administration 2, the scaled examinee proficiency O' = (6 — B)/A has a 
standard normal distribution. With respect to the scaled proficiency O', the conditional probability 
P'n(k\0') that an examinee with scaled proficiency O' in Administration 2 has response score k on 
a presented item I m satisfies the logit relationship 

\og[P^k\0)/P^(k - 1|0)] = Do! m [0' - b' m + d' mk ), 

where a' rn = Aa m , b' m = (b m — B)/A, and d' m = dm/A. Marginal maximum likelihood, conditional 
on the items presented to each examinee, is separately employed for each Form h (Bock & Aitkin, 
1981). Administration 1 yields maximum-likelihood estimates a m for a m , b m for b m , and d u k for 
d u k for 1 < m < mi and 1 < M\. Administration 2 yields maximum-likelihood estimates a' m for 
a' m , b' m for b' m , and d' uk for m' uk , 1 < k < rj — 1, for 1 < m < M and for M\ + 1 < m < M 2 . 

To estimate A and B, the Stocking-Lord method is used (Stocking & Lord, 1983). Here 
the estimated test characteristic curves for the items used in scoring is computed for the two 
administrations. For Administration 1, 

M rm — 1 

m = kP m (k\o), 

m= 1 fc=l 

where the estimated conditional probabilities P rn (k\9) are determined by the equations 

\og[P m {k\0)/P m (k - 1|0)] = Da m (0 - b m + d uk ) 

for 1 < k < r m — 1 and by the constraint that 'Y/P/Pq 1 P m (k\0) = 1 . In like manner, for 
Administration 2, 

M r m -1 

= E E kp m(m, 

m =1 k =1 

where the estimated conditional probabilities P' m (k\0) are determined by the equations 

\og[P' m {k\0)/P' m {k - 1\6)] = Da! m {0 - b' m + d' mk ) 
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for 1 < k < r m — 1 and by the constraint that = E Estimates A for A and B for 

B are then obtained by minimizing the integral 

J[T(0) - T'(AO + B)] 2 c/)(0)d0, 


where (j) is the density function of the standard normal distribution. The integral must be 
evaluated by some numerical quadrature method. In the analysis performed in this report, the 
ETS convention was followed that the integral was approximated by use of 201 equally spaced 
quadrature points from —3 to 3. 

Given A and B, parameter estimates are obtained for Administration 2. Thus a m is estimated 
by a u2 = d! m /A, b m is estimated by b m2 = Ab' m + B, and d m k is estimated by d m k 2 = Ad' mk . For 
the items used in reporting scores, test characteristic curves 

M± T m 1 

Ti(0) = E E kp m(k\0) 

m=M +1 k= 1 


and 


are obtained. Here 


M 2 r m -1 

f 2 (B) = E E kP m2(k\0) 

m=Mi~\-l k =1 


log[P m2 (k\0)/P m2 {k - 1|6»)] = Da m2 (6 - b m2 + d mk2 ) 

for 1 < k < r m — 1 and X^fc=o 1 Pm2(k\0) = 1. In Administration 1, total scores for items numbered 
from M + 1 to M\ can range from 0 to 

Mi 

Si = E ( r ™“ 1 )- 

m=M+l 

In Administration 2, total scores for items numbered from M\ + 1 to M 2 can range from 0 to 

m 2 

S 2 = E 

m=M\+l 

Consider a total score s for Administration 2. In true-score equating, an s 2 of 0 in Administration 2 
is linked to si = 0 in Administration 1, while s 2 = S 2 in Administration 2 is linked to si = S\ 
in Administration 1. If 0 < s 2 < S 2 , then s 2 in Administration 2 is linked to T\(6 2 ), where 
T 2 (0 2 ) = s 2 . 
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If the model assumptions employed all hold, it is readily shown that all estimates developed 
in this section have suitable properties for application of the jackknife. Thus the example is 
appropriate for the jackknifing methods developed in section 1.7. As previously noted, there is 
some complication to the extent that items are not, in practice, selected at random in typical 
educational tests. Thus some care will be needed in discussing the effect of selection of anchor 
items. 

It should be noted that the standard errors determined by jackknifing apply whether the 
model assumptions used in linking are true or not. Parameters still have asymptotic means, but 
their interpretation is more complex (Haberman, 2007). 

3 Example 

To illustrate methodology, data from two sections of an assessment are considered for two 
administrations. In the first section, to be termed section 1, 42 items are used to score examinees, 
and 28 anchor items are employed, so that M = 28, M\ = 70, and M 2 = 112. In the second 
section, to be termed section 2, 34 items are used to score examinees, and 17 anchor items 
are used, so that M = 17, M\ = 51, and M 2 = 85. Total sample sizes are about 6,000 for 
Administration 1 and 8,000 for Administration 2. About 1,600 examinees in each administration 
receive the anchor items for section 2. In the case of section 1, about 3,100 examinees receive the 
anchor items for Administration 2, and about 1,600 receive the anchor items for Administration 1. 
As previously noted, for jackknifing of examinees, 120 disjoint subsets are employed. 

Table 1 provides estimates and standard errors for parameters A and B for the two sections. 
In these computations, the common items are assumed to be given. Results for a model in which 
the common items are randomly drawn differ for the two sections. The effects of item selection 
are examined by removal of one anchor item at a time. In section 1, the ratio Fgs = d^/A gs of 
(63) is 2.44 for estimates of A and 4.01 for estimates of B. Given that M is 28 and k is 120, both 
F statistics are very highly significant, so that and < 7 ^ appear to be positive, and the effect of 
anchor selection is a concern. With anchor sets regarded as random, the estimated asymptotic 
standard deviation of A is increased to 0.0323, and the estimated asymptotic standard deviation 
of B is increased to 0.0330. These estimates are considerably larger than the customary estimated 
asymptotic standard errors, so there is cause for concern about the effects of item selection. In 
section 2, the respective Fgs statistics are 0.81 and 1.18, and M is now 17, so that no clear 
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evidence is present that selection of anchor items has an effect. 


Table 1 

Estimated A and B Parameters and Estimated Asymptotic Standard Errors 


Parameter 

Section 

Estimate 

Standard error 

A 

1 

0.989 

0.024 

B 

1 

-0.093 

0.022 

A 

2 

0.956 

0.029 

B 

2 

-0.086 

0.029 


Table 2 provides results for conversions of total item scores from Administration 2 to total 
item scores from Administration 1 for section 1. Table 3 provides the corresponding result for 
section 2. For section 1, there is an appreciable effect of anchor selection, but the basic result 
remains that standard errors can be as large as about a third of a raw score point. In section 2, 
selection of anchor items does not have an obvious effect, so only the conventional results are 
provided for the grouped jackknife. The standard errors are roughly comparable to those for 
section 1. Impact of the standard errors in practice depends on the choice of raw-to-scale 
conversion used in the testing program, on the standard deviation of scaled test scores, and on 
whether the assessment is applied to individual examinees or to groups of examinees. Reasons 
for the variability of results due to the specific anchor items selected in section 1 require further 
investigation. The general issue raised is that it is possible in practice for the selection of anchor 
items to have an appreciable effect on the variability of equating results. 

4 Conclusions 

The analysis in this report indicates that jackknifing may be employed both to examine 
sampling variability in equating and to analyze sensitivity of equating results to anchor selection. 
The approach used is very widely applicable to equating studies. It would also apply were 
alternative linking procedures applied such as concurrent calibration or mean and sigma methods 
(Hambleton et al., 1991). The approach may also be used when the general partial credit model is 
replaced by the partial credit model with a m constant for all m (Muraki, 1997). It is also quite 
possible to apply the approach to observed-score equating methods such as kernel equating (von 
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Table 2 


Estimated Conversions of Total Item Score for Section 1 


Score 

Estimate 

Stand. 

Fixed 

err. for anchor 

Random 

Score 

Estimate 

Stand. 

Fixed 

err. for anchor 

Random 

0 

0.000 

0.000 

0.000 

23 

20.868 

0.221 

0.279 

1 

1.255 

0.113 

0.135 

24 

21.792 

0.213 

0.277 

2 

2.338 

0.165 

0.200 

25 

22.729 

0.205 

0.276 

3 

3.343 

0.202 

0.246 

26 

23.680 

0.197 

0.277 

4 

4.300 

0.230 

0.280 

27 

24.645 

0.192 

0.280 

5 

5.226 

0.251 

0.304 

28 

25.625 

0.188 

0.285 

6 

6.130 

0.269 

0.323 

29 

26.621 

0.186 

0.292 

7 

7.015 

0.283 

0.338 

30 

27.633 

0.185 

0.300 

8 

7.888 

0.292 

0.348 

31 

28.663 

0.186 

0.308 

9 

8.752 

0.301 

0.355 

32 

29.712 

0.190 

0.317 

10 

9.608 

0.304 

0.357 

33 

30.781 

0.193 

0.326 

11 

10.458 

0.305 

0.358 

34 

31.871 

0.198 

0.333 

12 

11.306 

0.305 

0.356 

35 

32.985 

0.202 

0.339 

13 

12.152 

0.302 

0.352 

36 

34.123 

0.205 

0.343 

14 

12.999 

0.301 

0.347 

37 

35.287 

0.206 

0.342 

15 

13.847 

0.293 

0.340 

38 

36.476 

0.206 

0.338 

16 

14.699 

0.288 

0.333 

39 

37.690 

0.202 

0.326 

17 

15.555 

0.281 

0.325 

40 

38.924 

0.194 

0.307 

18 

16.418 

0.271 

0.315 

41 

40.168 

0.180 

0.279 

19 

17.289 

0.262 

0.307 

42 

41.406 

0.157 

0.238 

20 

18.168 

0.252 

0.298 

43 

42.612 

0.126 

0.185 

21 

19.057 

0.243 

0.292 

44 

43.760 

0.085 

0.117 

22 

19.957 

0.232 

0.285 

45 

45.000 

0.000 

0.000 


Davier, Holland, & Thayer, 2004). The example, as is common in textbook discussions, considers 
a simple linking of one form to another; however, the methodology is also suitable for examination 
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Table 3 


Estimated Conversions of Total Item Score for Section 2 


Stand, err. for Stand, err. for 

Score Estimate anchor fixed Score Estimate anchor fixed 


0 

0.000 

0.000 

18 

15.360 

0.228 

1 

0.821 

0.118 

19 

16.250 

0.215 

2 

1.617 

0.178 

20 

17.153 

0.204 

3 

2.418 

0.222 

21 

18.068 

0.195 

4 

3.240 

0.259 

22 

18.997 

0.189 

5 

4.082 

0.288 

23 

19.942 

0.184 

6 

4.940 

0.310 

24 

20.904 

0.183 

7 

5.807 

0.324 

25 

21.886 

0.183 

8 

6.678 

0.331 

26 

22.894 

0.187 

9 

7.551 

0.333 

27 

23.935 

0.193 

10 

8.421 

0.330 

28 

25.016 

0.199 

11 

9.288 

0.323 

29 

26.148 

0.205 

12 

10.152 

0.313 

30 

27.341 

0.209 

13 

11.014 

0.300 

31 

28.602 

0.208 

14 

11.876 

0.287 

32 

29.939 

0.199 

15 

12.739 

0.271 

33 

31.348 

0.175 

16 

13.606 

0.257 

34 

32.803 

0.122 

17 

14.479 

0.242 

35 

34.000 

0.000 


of a much more complex sequence of test forms that are linked through many different sets of 
anchor items. 

Consideration of both the sampling variability and the variability of equating results with 
respect to anchor selection is important in any assessment of the effectiveness of equating for a 
testing program. It is clearly important for the variability of equating results to be small relative 
to the measurement error for individual examinees. For example, consider the following case. 
With the conversion associated with an arbitrarily large sample of examinees and an arbitrarily 
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large selection of different anchor sets, the standard deviation of an examinee’s equated score on a 
form has a standard deviation of 5, and 0.84 is the form reliability of this score. Thus the standard 
error of measurement is 2. Suppose that use of finite samples to compute equating functions and 
use of one of many possible anchor sets results in a random scoring error for the examinee with 
mean 0 and standard deviation 1, and suppose that the random scoring error is uncorrelated with 
the examinee’s error of measurement based on the ideal conversion. Then the effective standard 
error of measurement is (2 2 + l 2 ) 1 / 2 = 2.236 rather than 2. The effective standard deviation is 
(5 2 + l 2 ) 1 / 2 = 5.099, and the effective reliability is reduced to 1 — (2 2 + l 2 )/(5 2 + l 2 ) = 0.808. The 
equating error impact can be far more important when a group of examinees is studied. Consider 
a sample of 100 randomly selected examinees for the form under study. For these examinees, the 
standard deviation of the mean equated scores for the ideal conversion is 0.5, and the standard 
deviation of the mean error of measurement is 0.2, so the reliability of the estimated mean score 
remains 0.84. On the other hand, it is quite possible that the mean random scoring error has 
essentially the same distribution as the random scoring error for a single examinee, so that 1 
remains the standard deviation of the mean random scoring error. Thus the effective reliability of 
the mean equated score is now only 1 — (0.2 2 + l 2 )/(0.5 2 + l 2 ) = 0.168. 

The sensitivity of equating to the selection of anchor items is particularly important, for this 
problem does not become unimportant even when sample sizes are very large. As a consequence, 
it is of great importance that equating procedures be investigated for robustness to selection of 
anchor items. The approach in this report provides an appropriate method of investigation for 
both sampling errors and errors due to selection of anchor items. A general appreciation of the 
stability of equating results with respect to sample size and anchor selection requires a more 
comprehensive investigation of equating results from a substantial number of test administrations 
for a variety of testing programs. Such data can indicate the magnitude of variability commonly 
encountered and can suggest circumstances which lead to higher or lower variability. 
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